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Abstract 

A new inverse iteration algorithm that can be used to compute all the eigenvectors of 
a real symmetric tri-diagonal matrix on parallel computers is developed. The mod- 
ified Gram-Schmidt orthogonalization is used in the classical inverse iteration. This 
algorithm is sequential and causes a bottleneck in parallel computing. In this paper, 
the use of the compact WY representation is proposed in the orthogonalization process 
of the inverse iteration with the Householder transformation. This change results in 
drastically reduced synchronization cost in parallel computing. The new algorithm is 
evaluated on both an 8-core and a 32-core parallel computer, and it is shown that the 
new algorithm is greatly faster than the classical inverse iteration algorithm in comput- 
ing all the eigenvectors of matrices with several thousand dimensions. 

Keywords: inverse iteration, orthogonalization, compact WY representation, 
eigenvalue problem, parallelization, Householder transformation 



1. Introduction 

The eigenvalue decomposition of a symmetric matrix, i.e., a decomposition into a 
product of matrices consisting of eigenvectors and eigenvalues, is one of the most im- 
portant operations in linear algebra. It is used in vibrational analysis, image processing, 
data searches, etc. 

Let us note that the eigenvalue decomposition of real symmetric matrices is re- 
duced to that of real symmetric tri-diagonal matrices. Owing to recent improvements 
in the performance of computers equipped with multicore processors, we have had 
more opportunities to perform computation on parallel computers. As a result, there 
has been an increase in demand for an eigenvalue decomposition algorithm that can be 
effectively parallelized. 

The inverse iteration algorithm is an algorithm for computing eigenvectors inde- 
pendently associated with mutually distinct eigenvalues. However, when we use this 
algorithm, we must reorthogonalize the eigenvectors if some eigenvalues are very close 
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to each other. Adding this reorthogonalization process increases the computational 
cost. For this reorthogonalization, we have generally used the MGS (modified Gram- 
Schmidt) algorithm. However, this algorithm is sequential and inefficient for parallel 
computing. As a result, we are unable to maximize the performance of parallel com- 
puters. Hereinafter, we will refer to the inverse iteration algorithm with MGS as the 
classical inverse iteration. 

We can also orthogonalize vectors by using the Householder transformation [10] 
and we call this precess the Householder orthogonalization algorithm. While the MGS 
algorithm is unstable in the sense that the orthogonality of the resulting vectors cru- 
cially depends on the condition number of the matrix 111 111 , the Householder algorithm 
is stable because its orthogonality does not depend on the condition number. The 
Householder algorithm is also sequential and ineffective for parallel computing, and 
its computational cost is higher than that of MGS. 

In 1989, the Householder orthogonalization in terms of the compact WY represen- 
tation was proposed by R. Schreiber et al By adopting this orthogonalization, sta- 
bility and effective parallelization can be achieved. Hereafter, we refer to this algorithm 
as the compact WY orthogonalization algorithm. Yamamoto et al. Ill ill reformulated 
this algorithm for an incremental orthogonalization. Moreover, They showed that this 
algorithm achieves theoretically high accurate orthogonality and high scalability in par- 
allel computing [ 1 JJQ . Here, the incremental orthogonalization is implemented on many 
numerical computation library. LAPACK(Linear Algebra PACKage) 10] is one of the 
most popular libraries and all the code of LAPACK is implemented by using BLAS 
(Basic Linear Algebra Subroutines ) operations. The compact WY orthogonalization 
algorithm can be implemented by using BLAS. 

In (a], authors have implemented the compact WY orthogonalization to the re- 
orthogonalization process of inverse iteration for computing eigenvectors of a tri-diagonal 
matrix. It is shown ffl that, in parallel computing, the new inverse iteration algorithm 
is faster than the classical one. 

In this paper, we present two implementations: One is a new implementation of the 
compact WY orthogonalization algorithm based on BLAS. We focus on a mathematical 
structure of this algorithm and reformulate this algorithm. Therefore, using this new 
implementation, the computational cost of the compact WY orthogonalization can be 
reduced. The other is an implementation of the compact WY orthogonalization to the 
inverse iteration algorithm for a real symmetric tri-diagonal matrix. Thereafter, we 
perform the numerical experiments by computing all the eigenvectors using the second 
implementation and evaluate its performance. 

2. Classical inverse iteration and its defect 

2.1. Classical inverse iteration 

We consider the problem of computing eigenvectors of a real symmetric tri-diagonal 
matrix T 6 R" x ". Let Ay Elbe eigenvalues of T such that Ai < A2 < • • • < A„. Let 
Vj £ M." be the eigenvector associated with Ay. When A/, an approximate value of Ay, 
and a starting vector are given, we can compute an eigenvectors of T. To this end, 
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Alg. 1 Classical inverse iteration 



for j = 1 to n do 

Generate from random numbers. 

k = 0. 

repeat 

k<-k+l. 

Normalize Vj . 

Solve (T - A/) vf = vf~V (Eq.CQ). 

if then|Xy-A ; -_il ^ l0 ~ 3 \\ T \l 
for i = 71 to 7' - 1 do 

V) ' <-y) >-{v) ',Vi)Vi 

end for 

else 

h = i- 

end if 

until some condition is met. 
Normalize vy to v,-. 
end for 



we solve the following equation iteratively: 

(T-ljl)vf =vf" 1) . (1) 

Here / is the n-dimensional identity matrix. If the eigenvalues of T are mutually well- 
separated, Vj , the solution of Eq.(fl]i, generically converges to the eigenvector associ- 
ated with Xj as k goes to °°. The above iteration method is the inverse iteration. The 

computational cost of this method is of 0(mn) when we compute m eigenvectors. In 

(k) 

the implementation, we have to normalize the vectors Vj to avoid overflow. 

When some of the eigenvalues are close to each other or there are clusters of eigen- 
values of T, we have to reorthogonalize all the eigenvectors associated with such eigen- 
values because they need to be orthogonal to each other. In the classical inverse itera- 
tion, we apply the MGS to this process and the computational cost of it is of 0(m 2 n). 
Therefore, when we compute eigenvectors of the matrix that has many clustered eigen- 
values, the total computational cost increases significantly. In addition, the classical in- 
verse iteration is implemented the Peters- Wilkinson method [8]. In this method, when 
the distance between the close eigenvalues is less than 10 _3 ||r||, we regard them as 
members of the same cluster of eigenvalues, and we orthogonalize all of the eigen- 
vectors associated with these eigenvalues. The classical inverse iteration algorithm is 
shown by Alg[T] and j\ denotes the index of the minimum eigenvalue of some cluster. 
This algorithm is implemented as DSTEIN in LAPACK 

2.2. The defect of the classical inverse iteration 

The inverse iteration is a prominent method for computing eigenvectors, because 
we can compute eigenvectors independently. When there are many clusters in the dis- 
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Alg. 2 Householder orthogonalization 
l: for j = 1 to m do 

2: uj-t- (i-tmyj)vj 

3: for i = 2 to j — 1 do 
4: 8y«- (i-tiyiyj)uj 

5: end for 

6: Compute yj and f ; by using uj 

7: (i-tjyjy})*j 

8: for / = j — 1 to 1 do 

9: 9j<- {i-taaDij 

10: end for 
li: end for 



tribution of eigenvalues, the inverse iteration can be parallelized by assigning each 
cluster to each core. 

Let us consider the Peters-Wilkinson method in the classical inverse iteration. When 
the dimension of T is greater than 1000, most of the eigenvalues are regarded as being 
in the same cluster yfl. In this case, we have to parallelize the inverse iteration with 
respect to not the cluster but the loop described from lines 2 to 16 in AlgQ] This loop 
includes the iteration based on Eq.(Q~|i and the orthogonalization of the eigenvectors. 
This orthogonalization process becomes a bottleneck of the classical inverse iteration 
with respect to the computational cost. The MGS algorithm is mainly based on a BLAS 
level- 1 operation and it is a sequential algorithm. Because of this, when we compute 
all the eigenvectors on parallel computers, the number of synchronizations is of 0(m 2 ). 
Therefore, the MGS algorithm is ineffective in parallel computing. 

In conclusion, the classical inverse iteration is an ineffective algorithm for parallel 
computing because the MGS algorithm is used in its orthogonalization process. 

3. Other orthogonalization algorithms 

In this section, we introduce alternative orthogonalization algorithms instead of 
the MGS algorithm. Now, we discuss the incremental orthogonalization of vj € M." 
to qj € W (j = 1, . . . , m, m < n). The incremental orthogonalization arises in the 
reorthogonalization process on the inverse iteration and it is defined as follows: vj 
(2 < j < m) is not given in advance but is computed from q\, qj-i- 

In the following, Let us define a vector 0, as the /-dimensional zero vector and 
matrices V, Q G R" xm as V = [vi ■■■ v m ],Q = [q 1 ■■■ q m }. 

3.1. Householder orthogonalization 

The Householder orthogonalization, based on the Householder matrices, is one of 
the alternative orthogonalization methods. When vectors Uj, Wi eR° (J = 1, . . . , m) 
satisfy ||k/||2 = ||w/||2, there exists the orthogonal matrices Hj called the Householder 
matrices satisfying HjHj = HjHj = I, HjUj = Wj defined by Hj = I — tjyjyj , yj — 
Uj—Wj, tj = 2/\\yj\\\. The transformation from Uj to vj by Hj is called the Householder 
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transformation. By using the Householder transformations. This orthogonalization 
algorithm is shown in Algf2] The vector y j is the vector in which the elements from 
1 to (j — 1 ) are the same as the elements of uj and the elements from (j + 1) to n are 
zero. The vectors uj and Wj are defined as follows: 



u jJ u j+iJ "' u n,j\ 




= Hj- 1 H j -2-H 2 H l v J , 

w j=[ u U u J-hj C J °n-j 

where Ufj (i= 1, ...,«) is the /-th element of uj and 

Cj = -sga(ujj 
Here, yj and f ; are computed as follows: 

yj = »j-»j= [°J- i u jj - c i u j+ u • • • M » T i o = ]rn^ • ( 2 ) 

The vector e ; in Alg|2]is the j-th vector of an n-dimensional identity matrix. 

The orthogonality of the vectors qj generated by the Householder orthogonaliza- 
tion does not depend on the condition number of V. Therefore, the Householder or- 
thogonalization is more stable than MGS. On the other hand, being similar to MGS, 
it is a sequential algorithm, that is mainly based on a BLAS level- 1 operation. Its 
computational cost is about twice higher than that of MGS. Thus the Householder or- 
thogonalization is an ineffective algorithm for parallel computing. 

3.2. Compact WY orthogonalization 

In 1989, the Householder orthogonalization in terms of the compact WY repre- 
sentation was proposed by Schreiber and van Loan H. Yamamoto and Hirota it 1 1 M 
reformulated this algorithm for the incremental orthogonalization. This study suggests 
that the Householder orthogonalization becomes capable of computation with a BLAS 
level-2 operation in terms of the compact WY representation. They also showed that 
this algorithm achieved theoretically high orthogonality and high scalability in parallel 
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Now, we consider the Householder orthogonalization in Algf2]and we introduce the 
compact WY representation. First, we define Y\ = \y{\ 6 R MXl and 7J = [t{\ £ R lxl . 
Let us define matrices Yj 6l" x ' and upper triangular matrices Tj £ W XJ recursively 
as follows: 



Yj = [Yj-i Jj\ , Tj = 



°J-i 



(3) 



In this case, the following equation holds 

H 1 H 2 :.H j =I-YjTjYT. (4) 

As shown in Eq.©, we can rewrite the product of the Householder matrices H\H2--- Hj 
in a simple block matrix form. Here / — YjTjYj is called the compact WY representa- 
tion of the product H\H2 ■■■Hj of the Householder matrices. Algf3]shows the compact 
WY orthogonalization algorithm. 
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Alg. 3 compact WY orthogonalization algorithm 



1: Compute j>i and t\ by using u\ = v\ 

2: Yi = fa], T\ = [ h ] 

3: q^il-YxW^ej 

4: for j = 2 to m do 

5: "r (i Y J > 7 V 
6: Compute y / and ? ; - by using uj 

t- Yj = \Yj~\ .v,;./) r ' 



'7-1 





8: qj 

9: end for 



I-YjTjYAej 



-tjTj-xYj^yi 
ti 



3.3. Implementation of compact WY orthogonalization 

In this subsection, we discuss the implementation of the compact WY orthogo- 
nalization algorithm using BLAS operations. In addition, we discuss a mathematical 
structure of this algorithm and present a new implementation of the compact WY or- 
thogonalization for reducing the computational cost and the usage of memory. 

3.3.1. Ordinary implementation of compact WY orthogonalization using BLAS 

Now we discuss the implementation of the compact WY orthogonalization based 

on line 5 to 8 in Alg [3] using BLAS operations. 

For the adaptation of BLAS operations, we have to reformulate the formula of line 

5 as follows: 

"i (' *i i>; :)'•/ 

= vj-Y J -iT/_ 1 Yl. 1 Vj 

Now we can implement this formula by using BLAS as follows: 

(uj^Vj (DCOPY) 
I v'j_ j <— Fjljiij + • v'j_ , (DGEMV) 

( v 'j-i <" T 7-i v 'j-i (DTRMV) ' 

uj 4r- (-1) -Yj^v'j^ +uj (DGEMV) 

where v'j_ 1 £ If' -1 . We set the initial address of v'_j assigned on CPU memory to 
correspond to that of Vj. DCOPY denotes the copying operation of a vector x to a 
vector y : y <— x. DGEMV means the matrix-vector operation: y ^— aAx + fiy, where 
A is a general rectangular matrix. DTRMV denotes the matrix-vector product: x ^— Tx, 
where T is a triangular matrix. 

Next, on line 6, we compute yj and tj based on Eq.©. These computations is 
mainly performed by using BLAS level- 1 operations and its computational cost is rel- 
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atively lower, we implement the computation of yj and tj as follows: 



v,,- 0. = 1,-.., 7-1) 

yi,j*-%j, (> = j> •••>«) 



(DCOPY) 

yj,J <~ ■ - '•/• c i = - sgn("yj) \ (DNRM2) 
f 7 ^2/|b,||2 ' (DNRM2) 



where yy (i = 1, . . . , ri) is the z'-th column element of yj. DNRM2 denotes the compu- 
tation of the 2-norm of a vector. 

On line 7, updating Yj and tj can be done easily. Now, let tj E I 
—tjTj-iYjitfj. Note that tj is implemented by using BLAS as follows: 



be tj = 



(DGEMV) 
(DTRMV) 



At last, on line 8, we can reformulate as follows: 

= ej-Y j T J Y/e J . 
Here, the matrix- vector product Yjej can be simplified as follows: 



\y.i,j. 



. This computation can be performed only by copying the y'-th column of Yj to some 
vector. Therefore we can implement the formula of line 8 using BLAS as follows: 



1i <- e i 



yjj 



j 

9j 



Tjv'j 

(-i).y/+^ 



(DCOPY) 
(DCOPY) 

(DTRMV) 
(DGEMV) 



where v'j G W, qj G R". We set the initial address of v'j, qj assigned on CPU memory 
to correspond to that of Uj, Vj, respectively. 

The computational cost of the above compact WY orthogonalization algorithm is 
almost 4m 2 n + m 3 . In the worst case, i.e., m~n, the computational cost is 5n 3 . 

In addition, for this implementation, we have to use almost mn + m 2 CPU memory 
because Y m use mn and T m use m 2 domain. 

3.3.2. New implementation of compact WY orthogonalization using BLAS 

In the above section, we discuss the ordinary implementation of the compact WY 
orthogonalization algorithm. Now we focus on the mathematical structure of this algo- 
rithm and present the new implementation of the compact WY orthogonalization which 
has the less computational cost than the ordinary one has. 



7 



Before the formula of line 5 in Algf3] let us consider the formula of line 6. From 
Eq.©, we can strictly compute tj as follows: Since 



we have 



Hence, we have 



cj = -sgn(ujj) 



\\yj\\ 2 2 = (uj,j-cj) 2 + L ujj 

'=i+i 

it 

E";- / 2 "> j c > • <"/ 

2 ( c j »/,/'•/ 



lb; 



c j u j,j c i 



From this fact and the definition of yj and Cj, we need not compute the elements from 
1 to (j — 1) of « ; in actual. Therefore we compute only the elements from j to n of Uj 
so that the formula of line 5 is reduced as follows: 

a j = a j-Yj-iT J J _ l Y/_ l Vj, 



where uj S R" - ' 7 ' -1 ' is uj = [ujj ■■■ . 

Here, we focus on the structure of yj. From Eq.©, yj (J = 2, m) can be 
represented as the block vector of the form: 

0,-1 

h 

where y j 6 R" - ^ -1 ) is the vector of nonzero elements of y ,-. From this fact, Yj can be 
represented as the following block matrix: 



where Lj G W x i is a lower triangular matrix and Yj e R^" ■')*■' is generally a dense 
rectangular matrix. In addition, let us consider v ; as the block vector of the form: 



where V j E R^ 1 , Vj e R""^ 1 ). 
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By using these block form of vj and Yj, we can reduce the computational cost of 
the matrix- vector product Yj_ x Vj through 



Yj-i 



'■j ft ■ V- ,>v 



Therefore, the formula of uj can be simplified as follows: 

&j=&j-Yj. l Tl_ l (Lj_ l v j + Yj_ 1 v J ). 
This formula can be implemented by using BLAS as follows: 



(DCOPY) 
(DTRMV) 
(DGEMV) . 
j, (DTRMV) 

uj <- (-1)-Yj_ l v j + Uj (DGEMV) 



9 J ^ + j 



From the above discussion, the computation on line 6 is implemented by using 
BLAS as follows: 



v;,/ < (i = j, •••,«) 



(DCOPY) 



yjj <~ UJJ - '•/• cj = - sgn :»,., :• ^ If ; «?, (DNRM2) . 
tj^ l /( c2 j- u jJ c j) 
On line 7, we can also reduce the computational cost of ij through 

h = -'j T .i \ Y i \y) 



vr 


T 


0,-i 









=-0' 7 ;-i(^7-iO;-i+^i^) 

This formula can be implemented by using BLAS as follows: 
/ tj <- (-tj^Jj + • tj (DGEMV) 



Tj-itj 



(DTRMV) 



At last, on line 8, even if the sign of the orthogonal vector qj is reversed, the 
orthogonality along with other vectors is not changed. Therefore, we can reformulate 
qj as qj = {jjTjYj — I^jej. In addition, let us consider qj as the following block 
vector: 



Qj 



9 



. I , m + 1 




Figure 1 : Assignment model for Yj and Tj 



where qj e R 7 , qj S R" ; . These are reformulated as follows: 







'LjTjYjei 






9j. 




YjTjYje L 







where Sj is the y'-th vector of the /'-dimensional identity matrix. Therefore this formula 
can be implemented by using BLAS as follows: 



yj,i 
x j <- T J x i 
h <- *j 

ij <~ Y i x i 
{4]J*-4jJ 



yjj 



-Oqj 
- 1 



(DCOPY) 

(DTRMV) 
(DCOPY) 
(DTRMV) 
(DGEMV) 



where Xj € 



; is assigned on workspace memory. 
When the above implementation is adapted, the highest order of the computational 
cost of the compact WY algorithm reduced to 4m 2 n — m 3 . In the worst case, i.e., m = n, 
the computational cost of the new implementation of the compact WY algorithm is 
almost 3n 3 . 

In addition, our implementation have not to be referred any zero elements of Yj 
and Tj. Therefore, if Yi and Tj are assigned on a CPU memory like Alg[T] the use of 
memory can be reduced to almost n(m+ 1), 

3.4. Comparison of the orthogonalization algorithms 

The compact WY orthogonalization has a stable orthogonality arising from the 
Householder transformations, and its numerical computation is mainly performed by 
BLAS level-2 operations. As a result, this orthogonalization has a better stability and a 
sophisticated orthogonality, and it is more effective for parallel computing than MGS. 
TableQ]displays the differences in performance of the orthogonalization methods men- 
tioned above. In this table, Computation denotes the order of the computational cost. 
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Table 1: Comparison of the orthogonalization methods [1] I'll] 



orthogonalization 


Computation 


Synchronization 


Orthogonality 


MGS 


2m 1 n 


0(m l ) 


0(BK(V)) 


Householder 


4m 2 n 


0(m 2 ) 


O(B) 


compact WY 


4m 2 n + m 3 


0(m) 


0(e) 


new compact WY 


4m 2 n — m 3 


0{m) 


0(e) 



Synchronization means the order of the number of synchronizations. Orthogonality 
indicates the norm \\Q T Q — 1\\ and e denotes the machine epsilon and k(V) is the con- 
dition number of V. 

4. Inverse iteration algorithm with compact WY orthogonalization 

Authors have proposed an alternative inverse iteration algorithm in |fj. This algo- 
rithm is based on the classical inverse iteration algorithm implemented in DSTEIN and 
we change the orthogonalization process of it from MGS to the compact WY orthogo- 
nalization that is described on Sec. 3.3.1. In addition, it is shown that this algorithm is 
faster than the classical inverse iteration one in parallel computing flg]. 

Now we present an even faster inverse iteration algorithm with the compact WY 
orthogonalization. This compact WY orthogonalization is implemented on the way 
of Sec. 3.3.2. The new algorithm is described in Alg|4] Let us name the new code 
DSTEIN-cWY. 

Next, we explain an application of the new implementation of the compact WY 
orthogonalization to the inverse iteration. Differences between DSTEIN-cWY and 
DSTEIN is as follow: For the classical inverse iteration algorithm, we need not know 
the index j c which denotes the j c -th eigenvalue of the cluster in computing the eigen- 
vector associated with it. However, we must know the index for the compact WY 
orthogonalization when we compute and update Tj, Yj. To overcome the above diffi- 
culty, we introduce a variable j c on line 9, and we can recognize it. This introduction 
of j c enables us to execute the intended program. 

In the classical inverse iteration algorithm, we need not know the first eigenvalue 
Ay, of the cluster. However, we must compute y \ and t \ in the new inverse iteration 
algorithm. Therefore, at the starting point of the computation of the eigenvector asso- 
ciated with the second eigenvalue + we compute T\ = [t\], Y\ = [y\] by using v /r 
At this time, because v u is a normalized vector so that it equals to (/ — YiTiY^)e\ , we 
need not compute v it again. 

5. Numerical experiments 

We describe some numerical experiments performed by using DSTEIN and DSTEIN- 
cWY on parallel computers, and we compare the computation time. Here DSTEIN of 
LAPACK is based on the classical inverse iteration, and DSTEIN-cWY makes use of 
the new inverse iteration presented in the previous section. 
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Alg. 4 compact WY inverse iteration 



for j = 1 to n do 

Generate v^ ' from random numbers. 

k = 

repeat 



Normalize Vj . 
Solve ( T - 



(r-v)»y-»s 



Ct-i) 



U\Xj-Xj-i\ < l(T j ||r||, then 

jc^j- ji- 
lt j c = 1 and k = 1, then 

Compute Ti = [yi] and Ti = [fi] by using vy, . 
end if 



Compute yj c+ i and fy c+1 by using Uj c+ \. 



15: Y jc+1 =[Y jc y jc+1 ],Tj c+1 = 

v[ k) <r- (i-Y. .,r f ,,y, T 



else 

jfi j- 
end if 

until Some condition is met. 
Normalize v, to v,-. 



end for 



5.1. Contents of the numerical experiments 

We report computations of all the eigenvectors associated with eigenvalues of some 
matrices by using DSTEIN and DSTEIN-cWY on parallel computers, and we compare 
the elapsed time. In these experiments, we compute the approximate eigenvalues by 
using LAPACK's program DSTEBZ, which is capable of computing eigenvalues using 
the bisection method. We record the elapsed time for DSTEIN and DSTEIN-cWY 
using SYSTEM_CLOCK, which is the internal function of Fortran. 

In the experiments, we use two computers equipped with multicore CPUs, and 
we implement those algorithms by using GotoBLAS2 [5], which is implemented to 
parallelize BLAS operations by assigning them to each CPU core. Table [2] shows 
the specifications of two computers. As experimental matrices, we use symmetric tri- 
diagonal matrices of three types. Type 1 is a tri-diagonal random matrix, of which 
elements are set to the random number of [0, 1). It is shown that the eigenvalues of 
a tri-diagonal random matrix are divided into a few clusters in the sense of Peters- 
Wilkinson method] 8]. and most of eigenvalues are included in the biggest one of the 
clusters if the dimension n of a random matrix becomes larger. The tri-diagonal matrix 
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Table 2: The specification of Computer 1 and 2 





Computer 1 


Computer 2 


CPU 


AMD Opteron 2.0GHz 


Intel Xeon 2.93GHz 


32cores(8coresx4) 


8cores(4coresx2) 


RAM 


256GB 


32GB 


Compiler 


Gfortran-4.4.5 


Gfortran-4.4.5 


LAPACK 


LAPACK-3.3.0 


LAPACK-3.3.0 


BLAS 


GotoBLAS2-1.13 


GotoBLAS2-1.13 



of Type 2 is defined as follows: 



T = 



1 1 

1 1 1 
1 '•• 



■• 1 
1 1 



(5) 



All the eigenvalues of Type 2 matrix with large dimensions are included in the same 
cluster in the sense of Peters-Wilkinson method. Type 3 is the glued-Wilkinson matri- 



ces W^. consists of the block matrix W 2 + [ 
and is defined as follow: 



1x21 



and the scalar parameter 8 G ' 



W g 





8 






8 




8 






8 




8 






8 





(6) 



where is defined by 



21 



10 1 
1 9 

1 



(7) 



1 

1 10 



and 8 satisfies < 8 < 1 and is also the semi-diagonal element of W^. Since is real 
symmetric tri-diagonal and its semi-diagonal elements are nonzero, all the eigenvalues 
of are real and they are divided into 21 clusters of close eigenvalues. When 5 is 
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Table 3: Numerical results of DSTEIN and DSTEIN-cWY on Computer 1 (Type 1). 



n 


1050 


2100 


3150 


4200 


5250 


6300 


7350 


8400 


9450 


10500 


t [sec] 


0.39 


1.76 


5.30 


17.4 


53.6 


157 


996 


2436 


4004 


13231 


^cwy [sec] 


0.41 


1.60 


3.77 


7.85 


13.7 


25.1 


115 


307 


449 


1291 


t i ^cwy 


0.94 


1.10 


1.41 


2.22 


3.90 


6.22 


8.64 


7.93 


8.93 


10.25 



Table 4: Numerical results of DSTEIN and DSTEIN-cWY on Computer 2 (Type 1). 



n 


1050 


2100 


3150 


4200 


5250 


6300 


7350 


8400 


9450 


10500 


t [sec] 

^cwy [sec] 


0.16 
0.18 


0.75 
0.73 


2.13 
1.70 


6.41 
3.42 


19.2 
7.66 


58.3 
24.7 


372 
179 


889 
430 


1416 

703 


4357 
1933 


t Acwy 


0.91 


1.02 


1.25 


1.87 


2.51 


2.36 


2.08 


2.06 


2.01 


2.25 



Table 5: Numerical results of DSTEIN and DSTEIN-cWY on Computer 1 (Type 2). 



n 


1050 


2100 


3150 


4200 


5250 


6300 


7350 


8400 


9450 


10500 


t [sec] 


1.73 


154 


448 


989 


1897 


3281 


5192 


7749 


10986 


14867 


?cwy [sec] 


0.45 


7.04 


28.1 


94.6 


167 


311 


476 


795 


1029 


1389 


t /?cwy 


3.85 


21.93 


15.94 


10.45 


11.34 


10.56 


10.92 


9.74 


10.68 


10.70 



Table 6: Numerical results of DSTEIN and DSTEIN-cWY on Computer 2 (Type 2). 



n 


1050 


2100 


3150 


4200 


5250 


6300 


7350 


8400 


9450 


10500 


t [sec] 


0.52 


57.4 


171 


375 


688 


1143 


1774 


2570 


3586 


4884 


( cwy [sec] 


0.20 


12.2 


55.3 


136 


266 


462 


723 


1067 


1519 


2070 


t i ^cwy 


2.67 


4.69 


3.10 


2.75 


2.58 


2.48 


2.45 


2.41 


2.36 


2.36 



Table 7: Numerical results of DSTEIN and DSTEIN-cWY on Computer 1 (Type 3). 



n 


1050 


2100 


3150 


4200 


5250 


6300 


7350 


8400 


9450 


10500 


t [sec] 


2.26 


11.5 


31.8 


72.9 


138 


230 


359 


526 


738 


986 


^cwy [sec] 


0.62 


2.49 


5.82 


10.9 


18.1 


28.4 


45.9 


74.5 


103 


141 


t / ^cwy 


3.66 


4.62 


5.47 


6.71 


7.66 


8.10 


7.82 


7.06 


7.18 


6.99 



Table 8: Numerical results of DSTEIN and DSTEIN-cWY on Computer 2 (Type 3). 



n 


1050 


2100 


3150 


4200 


5250 


6300 


7350 


8400 


9450 


10500 


t [sec] 


0.68 


3.58 


10.4 


24.5 


50.1 


86.8 


137 


203 


289 


393 


^cwy [sec] 


0.27 


1.10 


2.72 


6.59 


16.9 


35.7 


63.4 


103 


149 


209 


f Acwy 


2.54 


3.27 


3.83 


3.72 


2.97 


2.43 


2.16 


1.97 


1.94 


1.88 



small, the distance between the minimum and maximum eigenvalues in any cluster is 
small. In our experiments, we set 8 = 10~ 4 . Computing eigenvalues and eigenvec- 
tors of the glued- Wilkinson matrix is one of the benchmark problems of eigenvalue 
decomposition. For example, the glued- Wilkinson matrix was used to evaluate the 
performance of matrix eigenvalue algorithms J2] 01 • 

5.2. Results of the experiments 

Table |3]|8] show the results of the experiments on Computer 1 and 2 that are men- 
tioned in the previous section, In tables, n is the dimension of the experimental ma- 
trices, t and f cwy are computation time by DSTEIN and DSTEIN-cWY, respectively. 
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In addition, Fig. |20 illustrate the results in Tables and g]|5] and 011] and through 
graphs, respectively. In Fig. [2)0] the dotted line corresponds to t and the straight line 

tO ^cwy ■ 

It is noted that DSTEIN-cWY is faster than DSTEIN for any cases of the all types 
matrices, without the cases of Type 1 matrix for n = 1050. We see that the change from 
MGS to the compact WY orthogonalization on the DSTEIN code in parallel computing 
results in a significant reduction of computation time. We introduce a barometer t / t cv/y 
of the reduction effect by using the program DSTEIN-cWY which depends on n, the 
dimension of the experimental matrix. On Computer 1, the maximum value of a = 
t/t cvly is a = 10.25 for n= 10, 500 of Type 1, a = 10.92 for n = 7,350 of Type 2, and 
a = 8.10 for n = 6,300 of Type 3. On Computer 2, a = 2.51 for n = 5,250 of Type 1, 
a = 4.69 for n = 2, 100 of Type 2, and a = 3.83 for n = 3, 150 of Type 3. Considering 
these facts, even if the dimension of the experimental matrices is larger than that in 
these examples, we cannot expect that the computation time can be further shortened 
by using DSTEIN-cWY. 

5.3. Discussion on numerical experiments 

It is shown that DSTEIN-cWY is faster than DSTEIN for any dimension n of the 
experimental matrix both on Computers 1 and 2. As mentioned earlier, according 
to the theoretical background in Section 3.3, this result shows that the compact WY 
orthogonalization is an effective algorithm for parallel computing. 

The cause of this is related to the time required for floating-point arithmetic and for 
synchronization in parallel computing. The floating-point computation time increases 
with increasing the dimension n of matrices. In comparison, the synchronization cost 
does not change significantly even if n becomes larger. Therefore, in parallel com- 
puting, DSTEIN, which contains MGS (for which the number of synchronizations is 
large), creates a huge bottleneck for the synchronization cost when n is small. This 
bottleneck gradually becomes less when n is larger. However, DSTEIN-cWY has a 
smaller bottleneck for the synchronization cost because the compact WY orthogonal- 
ization requires less synchronization, and the floating-point computation time becomes 
greater than that of DSTEIN. This reduction effect can be seen in Table|3][8] 

6. Conclusions 

In this study, we present a new inverse iteration algorithm for computing all the 
eigenvectors of a real symmetric tri-diagonal matrix. The new algorithm is equipped 
with the new implementation of the compact WY orthogonalization algorithm, estab- 
lished in this paper, in the orthogonalization process. 

Now we use a new implementation of the compact WY orthogonalization. Intro- 
ducing this implementation, the computational cost of the compact WY orthogonaliza- 
tion can be reduced. 

We have given numerical experiments for computing eigenvectors of certain real 
symmetric tri-diagonal matrices that have many clusters with several thousand dimen- 
sions by using two types of inverse iteration algorithms on parallel computers. The 
results show that the compact WY inverse iteration is more efficient than the classical 
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Figure 2: Dimension n of Type 1 matrix and the computation time by DSTEIN and DSTEIN-cWY. the left 
graph corresponds to Computer 1 and the right Computer 2. 




1050 2100 3150 4200 5250 6300 7350 8400 9450 10500 
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Figure 3: Dimension n of Type 2 matrix and the computation time by DSTEIN and DSTEIN-cWY. the left 
graph corresponds to Computer 1 and the right Computer 2. 




1050 2100 3150 4200 5250 6300 7350 8400 9450 10500 1050 2100 3150 4200 5250 6300 7350 8400 9450 10500 



Figure 4: Dimension n of Type 3 matrix and the computation time by DSTEIN and DSTEIN-cWY. the left 
graph corresponds to Computer 1 and the right Computer 2. 



one owing to the reduction in computation time because of the parallelization effi- 
ciency. As the number of cores of the CPU increases, the parallelization efficiency 
increases. 

It may be expected to apply the new inverse iteration algorithms to other types of 
matrix eigenvector problem, such as eigenvectors of a real symmetric band matrix, or 
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singular vectors of a bidiagonal matrix. 
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